Dimension reduction

MACS 30000 University of Chicago

November 27, 2017

Unsupervised learning

  • Supervised learning
  • Unsupervised learning
    • Exploration

Dimension reduction

  • Visualize data with lots of variables
  • Use variables in supervised learning, but you have too many
  • Identify a small number of variables that collectively explain most of the variablity in the original data

DW-NOMINATE

  • Studying legislative voting
  • Roll-call votes
  • Multidimensional scaling techniques
  • DW-NOMINATE dimensions
    • Ideology (liberal-conservative)
    • Issue of the day

Voteview.org

Voteview.org

Voteview.org

Voteview.org

Principal components analysis

  • Find a low-dimensional representation of the data that contains as much as possible of the variation
  • Dimensions are linear combinations of the variables

First principal component

\[Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \dots + \phi_{p1}X_p\]

\[\sum_{j=1}^p \phi_{j1}^2 = 1\]

Estimation procedure

\[\max_{\phi_{11}, \dots, \phi_{p1}} \left \{ \frac{1}{n} \sum_{i=1}^n z_{i1}^2 \right \}\]

Such that \(\sum_{j=1}^p \phi_{j1}^2 = 1\)

Result of estimation

  • Loadings: \(\phi_{11}, \dots, \phi_{p1}\)
  • Scores: \(z_{11}, z_{21}, \dots, z_{n1}\)
  • Estimating second PC, third PC, etc.

Biplot of USArrests

Bag-of-words

 a abandoned abc ability able about above abroad absorbed absorbing abstract
43         0   0       0    0    10     0      0        0         0        1
  1. Sparsity
  2. Stop words
  3. Correlation between words

Latent semantic analysis

  • Identify words closely related to one another
  • Makes searching by keyword easier
  • Uses PCA or similar techinques

NYTimes

##  [1] "penchant"  "brought"   "structure" "willing"   "yielding" 
##  [6] "bare"      "school"    "halls"     "challenge" "step"     
## [11] "largest"   "lovers"    "intense"   "borders"   "mall"     
## [16] "classic"   "conducted" "mirrors"   "hole"      "location" 
## [21] "desperate" "published" "head"      "paints"    "another"  
## [26] "starts"    "familiar"  "window"    "thats"     "broker"

NYTimes

NYTimes

NYTimes

NYTimes

Topic modeling

  • Themes
  • Probabilistic topic models
  • Latent Dirichlet allocation

Food and animals

  1. I ate a banana and spinach smoothie for breakfast.
  2. I like to eat broccoli and bananas.
  3. Chinchillas and kittens are cute.
  4. My sister adopted a kitten yesterday.
  5. Look at this cute hamster munching on a piece of broccoli.

LDA document structure

  • Decide on the number of words N the document will have
  • Generate each word in the document:
    • Pick a topic
    • Generate the word
  • LDA backtracks from this assumption

Food and animals

  • Decide that \(D\) will be 1/2 about food and 1/2 about cute animals.
  • Pick 5 to be the number of words in \(D\).
  • Pick the first word to come from the food topic
  • Pick the second word to come from the cute animals topic
  • Pick the third word to come from the cute animals topic
  • Pick the fourth word to come from the food topic
  • Pick the fifth word to come from the food topic

How does LDA learn?

  • Randomly assign each word in the document to one of \(K\) topics
  • For each document \(d\):
    • Go through each word \(w\) in \(d\)
      • And for each topic \(t\), compute two things:
        1. \(p(t | d)\)
        2. \(p(w | t)\)
      • Reassign \(w\) a new topic \(t\) with probability \(p(t|d) \times p(w|t)\)
  • Rinse and repeat

  • Estimate from LDA
    1. The topic mixtures of each document
    2. The words associated to each topic

USCongress

Document classification

Associated Press articles

## <<DocumentTermMatrix (documents: 2246, terms: 10134)>>
## Non-/sparse entries: 259208/22501756
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

Perplexity

  • A statistical measure of how well a probability model predicts a sample
  • Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
  • Perplexity for LDA model with 12 topics
    • 2277.876

Topics from \(k=100\)